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ON ! Abstract 

We propose a flexible statistical model for high-dimensional quantitative 
prl '■ data on a hypercube. Our model, called the structural gradient model (SGM), 

is based on a one-to-one map on the hypercube that is a solution for an op- 
timal transport problem. As we show with many examples, SGM can de- 
scribe various dependence structures including correlation and heteroscedas- 
c/2 , ticity. The maximum likelihood estimation of SGM is effectively solved by 

the determinant-maximization programming. In particular, a lasso- type esti- 
mation is available by adding constraints. SGM is compared with graphical 
Gaussian models and mixture models. 

Keywords: determinant maximization, Fourier series, graphical model, lasso, 
C ■ optimal transport, structural gradient model. 



^ '• 1 Introduction 
o 

In recent years, it becomes more important to treat high- dimensional quantitative 
^ ■ data especially in biostatistics and spatial-temporal statistics. The graphical Gaus- 

sian model is one of the most important model. However, the Gaussian model repre- 
sents only the second-order interaction without heteroscedasticity. In this paper, we 
introduce the structural gradient model (SGM) that represents both higher-order 
and heteroscedastic interactions of data. The model is defined by a transport map 
that pushes the target probability density forward to the uniform density. The data 
structure is described by the parameters in the trans por t map . This model is a 
practical specification of the gradient model defined in ISeil (120061 ). 



We consider probability density functions on the hypercube [0, l] m written as 

p(x) = det(D 2 ij(x)), xe[0,l} m , (1) 

where ip is a convex function and D 2 ij){x) is the Hessian matrix of ip at x. The 
function p is a probability density function if the gradient map Dip is a bijection on 
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[0, l] m . In fact, by changing the variable from x to y — Dip(x), we obtain 
det(D 2 ip(x))dx = 



[o,i] r 



det 



[0,1]* 



) dx 



dx 



dy 



1. 



[o,i] T 



It is known that any probability density function on [0, l] m (actually on IR m ) is 
written as (ED). This f act is deeply connected to the theory of optimal transport 
see e.g. Ivillanil hood )). The bijective gradient map Dip, called the Brenier map, 



is the optimal-transport plan from the density ([TJ to the uniform density. In this 
paper, we call ijj the potential function. Furthermore, as explained in SectionEl most 
density functions on [0, l] m are characterized by the Fourier series of ip. When ip is 
represented by the Fourier series, we will call the model (0Q) the structural gradient 
model and refer to it as SGM. Unknown parameters are the Fourier coefficients of the 
potential function ip. SGM can describe not only two-dimensional correlations but 
also the three-dimensional interactions and heteroscedastic structures, unlike the 
graphical Gaussian model. We examine this flexibility by simulation and real-data 
analysis. 

The maximum likelihood estimation of SGM is reduced to a determinant maxi- 
mization problem with a robust convex feasible region. In practice, this region is not 
directly used because it is described by infinitely many constraints. We give two dif- 
ferent approaches to overcome this difficulty. First we give a sequence converging to 
the feasible region from the inner side. Secondly we give a /^-conservative region. 
These approaches e nable us to calculate the es timator by the determinant maxi- 
mization algorithm (IVandenberghe et al.l (119981 )). As a by-product of the second 
approach we have a lasso-type estimator for S GM. A related estimator is th e lasso - 
type estimator for gr a phical Gauss i an models (IMeinshausen and Biihlmannl (120061 ). 



Yuan and Lin 



(120071 ). 



Bunea et al. 



(ho07l ). lBanerjee et ali (hoosl n. 
We consider only the case in which the sample space is a hypercube. However, 
this is not a strong assumption because we can transform any real-valued d ata into 
[0, 1] -v alued data by a fixed sigmoid function. Unlike the copula models (INelsen 
(120061 )). the marginal density of SGM does not need to be uniform. Our model can 
still adjust the marginal densities after the sigmoid transform. An other appr o ach 
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to dea l with unbounded data is given by the author's past papers (jSeil (120061 ) 
(120071 )). where optimal transport between the standard normal density and other 
densities is considered. In this paper, we use the uniform density instead of the 
normal density because the former is analytically simpler than the latter. 

This paper is organized as follows. In Section El we define SGM and give various 
examples of it. In Section [HI we investigate the maximum likelihood estimation 
and propose a lasso-type estimator. In Section HI we compare SGM with graphical 
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Gaussian models and mixture models by numerical experiments. Finally we have 
some discussions in Section [51 All mathematical proofs are given in Appendix. 



2 The structural gradient model (SGM) 

In this section, we first give the formal definition and some theoretical properties of 
SGM. Then various examples follow. 

2.1 Definition and basic facts 

Let m be a fixed positive integer. Denote the gradient operator on [0, l] m by D = 
(d/dxi) 7 ^ l and the Hessian operator by D 2 = (d 2 / dxidxj)™j =1 . The determinant of 
a matrix A is denoted by det A The notation A y B (resp. A y B) means that 
A — B is positive definite (resp. positive semi-definite). Let Z> be the set of all 
non-negative integers. 

Definition 1 (SGM). Let U be a finite subset of Z> . We define the structural 
gradient model (abbreviated as SGM) by Eq. (TjQ) with the potential function 

1 9 m 

ip(x\9) = -x T x - ^ -^YlcosiirUjXj), (2) 

ueu j=i 

where x = (xj) G [0, l] m and 9 = (9 U ) G IR W . We call U the frequency set. The 
parameter space of SGM is 

6 = {9 eR u \ D 2 ip(x\9) y 0, VxG[0,l] m }- (3) 
A vector 9 G M. u is called feasible if 9 G G. We also call Q the feasible region. □ 

The following lemma is fundamental. 
Lemma 1. If 9 is feasible, then p(x\9) is a probability density function on [0, l] m . 

SGM has sufficient flex ibility for multivariate modeling because the following the- 



orem by ICaffarellil (120001 ) holds. To state the theorem, we prepare some notations. 
Denote the 2m faces of [0, l] m by Fj = {x G [0, l] m | Xj = b} for j G {1, . . . , m} and 
b G {0, 1}. For a smooth function ip on [0, l] m , we consider a Neumann condition 

dip(x) 



dxj 



b for any x G (4) 



It is easily confirmed that the function ip defined by (j2j) satisfies the Neumann condi- 
tion (jlj). Conversely, if ip(x) satisfies the Neumann condition then it is expanded 
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by an infinite cosine series in L 2 sense (see e.g. page 300 of IZygmundl (120021 )). In 
other words, the function (j2j) approximates any potential function satisfying (J4j) if 
we make the frequency set U large. Now we describe the Caffarelli's theorem. Here 
we put a slightly stronger assumption than his. 



Theorem 1 (Theorem 5 of ICaffarellil (120001 )). Let p(x) be a strictly positive and 



continuously different iable function on [0, l] m . Assume that p(x) satisfies a Neumann 
condition dp(x)/dxj = for any x G Fj. Then there exists a twice-differentiable 
convex function ip{x) such that (CQ) and (j4j) hold. 

Since the conditions for p(x) in the above theorem are differentiability and a 
boundary condition, we can construct sufficiently many statistical models by SGM. 
In the following subsection, we enumerate various examples of SGM. In Section El 
we discuss removal of the boundary condition for p(x) by removing the twice- 
differentiability condition for ip{x). 

For the one-dimensional case (m = 1), SGM becomes a mixture model as will 
be explained in the following subsection. For the multi-dimensional case (m > 1), 
SGM is not a mixture model except for essentially one-dimensional case. 

Lemma 2. SGM is not a mixture model unless there exists some i G {1, . . . ,m} 
such that U C Zj, where Zj = {u G Z> | Uj = Vj ^ i}. 

We use the following mixture model reference. 

Definition 2 (MixM). Let U be a finite subset of Z> . We define a structural 
mixture model (referred to as MixM) by 

m 

p{x\9) = i+j2 e u\\u\\ 2 n cos(7rUjXj), (5) 

ueu j=i 

where x = (xj) G [0, l] m , 9 = (9 U ) G M. u and ||w|| 2 = Y^j=i u "j- ^he feasible region is 
6 := {9 G R u | p(x\0) > Vx G [0, l) m }. □ 

In the following lemma, we prove that SGM and MixM have a common score 
function at the origin 9 = of the parameter space. The Fisher information matrix 
at the origin is also calculated. 

Lemma 3. The score vector at the origin 9 = of both SGM and MixM is equal to 
(\\ u \\ 2 Y[T=i cos ( 7lu j x i)) u ^ u - T ne Fi sner information matrix (J U v)u,veU at the origin 



9 = of both the models is given by 



w|| 4 l{ u =„} 



Juv 2 1^)1 



where a{u) — {j G {1, . . . , m} \ Uj > 0}. In particular, J uv is diagonal. 
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The Fisher information matrix J at the origin is useful if we deal with the testing 
of hypothesis 9 = 0. Under this hypothesis, the maximum likelihood estimator 9 
is approximated by a Gaussian random vector with mean and variance (nJ)^ 1 . 
In Section HI we will use the scaled maximum likelihood estimator J l / 2 9 to detect 
which components of 9 are significant. A method of computation for the maximum 
likelihood estimator is given in Section [31 In general, it seems difficult to calculate 
the Fisher information at the other points 9 ^ 0. Exceptional cases will be stated 
in the following examples. 



2.2 Examples 

We enumerate examples of SGM. We mainly compare SGM with MixM defined in 
Definition [2j For SGM, the following sufficient condition for feasibility of 9 is useful 
to deal with the examples. In Theorem [31 we will show that 9 is feasible if 

> ( 6 ) 

ueu 

for any j = 1, . . . , m. This condition is also necessary if, for example, U is a one- 
element set (see Theorem [3] for details). 

Example 1 (1-dimensional case). If m = 1, then the probability density of SGM is 
given by the Fourier series 

p(xi\9) = 1 + 9 u u 2 cos(nuxi). 
This coincides with MixM (Definition El). The model is conside red as a particular 



case of the circular model proposed by iFernandez-Duranl (120041 ) . If U = {u} with 



some u G Z >0 , then the Fisher information J uu (9) is explicitly expressed for any 
feasible 9 = 9 U . In fact, 

V ; 0V1 - 2 u 4 

The proof is given in Appendix. □ 

Example 2 (Independence). Let m = 2 and 

U = {(ui, 0) | ui G Ux] U {(0, u 2 ) | u 2 G U 2 }, 

where Ui (i = 1, 2) is a finite subset of Z>o. Then SGM becomes an independent 
model 

p( Xl ,x 2 \9) = 1+ 9 
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Independence of higher- dimensional variables is similarly described. On the other 
hand, if we consider MixM 



p(x 1 ,x 2 \9) = 1+ 2J 9{u ua )u\cos{-KUiXi) + 22 d(o,u 2 )ulcos(7ru 2 x 2 ), 

ui£Ui U2&A2 

then x\ and x 2 are not independent except for trivial cases. □ 

Example 3 (Correlation). Let m = 2 and U = {(1,1)}. Then a pair (Xi, X 2 ) drawn 
from p{xx,x 2 \9) has positive or negative correlation if > or < 0, respectively 
(see Figure CD). We confirm this observation by explicit calculation. We denote 
9 = c (0 = cos ( n O an d s(£) = sin(7r£) for simplicity. The density is 

p(x 1 ,x 2 \9) = det( -W X M?\) 
FK ' 1 ; V -9s(x 1 )s(x 2 ) 1 + 9c{x x )c{x 2 ) J 

q2 

= 1 + 29c(x 1 )c(x 2 ) + —(c(2 Xl ) + c(2x 2 )). 

By the condition ([6]), the feasible region for 9 is [—1,1]. The marginal density of Xi 
(i = 1,2) is exactly calculated as 

9 2 

p{xi\9) = l + —c(2x i ). 

The mean and variance of Xj (i = 1, 2) are 1/2 and (1/12) + # 2 /(47r 2 ), respectively. 
The correlation is 

Cov[Xi,X 2 ] 89 /vr 4 969 /ir A 



vAp^Vf^] (1/12)+^ 2 /(4tt 2 ) l + 3e 2 /7r 2 

The maximum correlation over 9 G [—1, 1] is 96/(7r 4 + 37r 2 ) ~ 0.7558 at 9 = 1. In 
contrast, if we consider MixM 

p(xi,x 2 |#) = 1 + 29c(x 1 )c(x 2 ), 

then the feasible region (i.e. the set of 9 that assures p(xi,x 2 \9) > 0) is \9\ < 1/2. 

The correlation is 96$ /7r 4 and its maximum value is 48/7T 4 ~ 0.4928 at 9 = 1/2. 

Thus SGM can describe a distribution with higher correlation than MixM. The 
Fisher information J uu {9) is explicitly expressed for any feasible 9, where u = (1, 1). 
The formula is 

JuM = gWg), (8) 

The proof is given in Appendix. □ 



6 



(a) 9 = 0.5. 



(b) e = -0.5. 



Figure 1: The probability density p(x\9) for U = {(1, 1)} and 9 = 0(i t i) = ±0.5. The 
correlation coefficient is about ±0.458 for 9 = ±0.5, respectively. 

Example 4 (Heteroscedasticity). Let m = 2 and U = {(1,2)}. Then a pair (X 1; X 2 ) 

drawn from p{x\, x 2 \9) has the following property: the conditional mean of X 2 given 

X\ does not depend on X\ but the conditional variance does (see Figure [2]). In other 

words, X 2 has heteroscedasticity in terms of regression analysis. We confirm this 

fact. The joint density is 

. _ / 1 + 9c(x 1 )c{2x 2 ) -29s{x 1 )s{2x 2 ) \ 
P{x 1 ,x 2 \v ) - aeiy _29s(x 1 )s(2x 2 ) 1 + 49c(x 1 )c(2x 2 ) J 

= 1 + 59c(x 1 )c(2x 2 ) + 29 2 c(2x 1 ) + 29 2 c(Ax 2 ) 

where we put c(£) = cos(7r£), s(£) = sin(7r£), and 9 = 9n 2 -\. The marginal density 
of Xi is p{x\) — 1 ± 29 2 c(2xi). The conditional density of X 2 given X\ is 

5#c(x 1 )c(2x 2 ) + 2fl 2 c(4x 2 ) 

P(X2|Xl '^ = 1 + 1 + 2^(2^) 

The conditional mean of X 2 given X\ is exactly 1/2, and therefore the correlation 
between X% and X 2 is zero. However, the conditional variance of X 2 given X\ is not 
constant: 

1 2 / , m , 1 1O0c(xi) + # 2 



(x 2 - 1/2) p(x 2 |a;i, 6 l )dx 2 



o 



12 4tt 2 {1 + 2# 2 c(2x 1 )}' 
In order to measure the dependency of Xi, let us consider the quantity 

E[(X 1 - 1/2)(X 2 - 1/2) 2 ] 



Pl22(0) 



{V[XJ}i/»V[X a ] 

-5^/tt 4 



{(1/12) + 2 /tt 2 } 1 /2{(i/12) + ^/(4tt 2 )} 
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The maximum value of f3\ 22 (9) over the feasible region 9 G [—1/4, 1/4] is /?i 22 (— 1/4) — 
0.5047. In contrast, for MixM p(x\,x 2 \9) = 1 + 59c(xi)c(2x 2 ), the maximum of 
0122(0) over the feasible region 9 G [-1/5, 1/5] is /?i 22 (-l/5) ~ 0.4267. Thus SGM 
can describe more heteroscedastic distributions than MixM. The heteroscedasticity 
appears in regression analysis, where explanatory and response variables are a priori 
selected. Remark that our model does not need a priori selection of variables. □ 



Figure 2: The probability density for U = {(1,2)} and 9 = 0.2. The conditional 
density p(x 2 \xi) is unimodal if X\ is close to 1, and bimodal if X\ is close to 0. 

Example 5 (three-dimensional interaction). Let m = 3 and U = {(1,1,1)}. Then 
the triplet (X 1; X 2 , X 3 ) has the three-dimensional interaction although the marginal 
two-dimensional correlation for any pair vanishes. We confirm this. The joint prob- 
ability density is 



where q = cos(7TXj) and Si = sm(nxi) for i = 1,2, 3. The density is symmetric 
with respect to permutation of axes. The feasible region is \9\ < 1 by ([6]). The 2- 
dimensional and 1-dimensional marginal densities aiep(xi, x 2 \9) = l+9 2 (<ic\c\ — l)/2 
and p(x\\9) = l + 9 2 (2c\ — l)/2, respectively. In particular, the mean of Xj is 1/2 and 
the correlation of Xj and Xj (i ^ j) is zero. However, there exists three-dimensional 
interaction between (Xi,X 2 ,X 3 ). We calculate 




p(x 1 ,x 2 ,x 3 \9) 



1 + 3#cic 2 c 3 + W 2 c\c\c\ + 9 z c\c\c\ 

r,n3 2 2 2 /i , n \q2/ 2 2 2, 2 2 2, 2 2 2\ 

—29 CiS l c 2 s 2 c 3 s 3 — (1 + 9ciC 2 c 3 )9 {c l s 2 s 3 + s l c 2 s 3 + s 1 s 2 c 3 J, 



0123(0) 



E[(X X - EXx)(X 2 - EX 2 )(X 3 - EX 3 )] 



Vvtx^vxjvx,] 
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The result is 

. ... -249/tc 6 - 1944fl 3 /729vr 

Pl23\9) = 



1/12 + # 2 / (4tt 2 )) 3 / 2 ' 
The maximum value of (3\ 23 {9) over the feasible region \9\ < 1 is fti 23 (— 1) — 0.7743. 
In contrast, for MixMpfa, x 2 , x 3 \9) = l+3#CiC 2 c 3 , we have (3 123 (9) = -288y/U6/n 6 . 
Its maximum value over the feasible region 16*1 < 1/3 is about 0.3459 at 9 = —1/3. 

□ 

Example 6 (Approximately conditional independence). Let m = 3 and (X\, X 2 , X 3 ) 
be drawn from a probability density p(x\, x 2 , x 3 ). In general, conditional indepen- 
dence of X\ and X 2 given X 3 is described by p(xi,x 2 ,x 3 ) = p(x 3 )p(xi\x 3 )p(x 2 \x 3 ) 
or, equivalently, the conditional mutual information 

/ 12 | S = /rf*., *,.*,) log /^'fff A .d^ 
J p{xi\x 3 )p{x 2 \x 3 ) 

vanishes. A log-linear model exp(/(xi,x 3 ) + g(x 2 , x 3 )) satisfies this condition. Al- 
though SGM does not represent any conditional-independence model, we can con- 
struct an approximately conditional-independence model. Let m = 3 and U = 
{(1, 0, 1), (0, 1, 1)}. Then, by putting q = cos(7TXj), Sj = sin(7TXj), 9 = #(i,o,i) and 
— ^(o,i,i); we have 

p(x 1 ,x 2 ,x 3 \9,(f)) 

I 1 + 6dc 3 -9s x s 3 
= det 1 + 4>c 2 c 3 —(f>s 2 s 3 

\ -9sxs 3 -<ps 2 s 3 1 + 9c x c 3 + 4>c 2 c 3 

= 1 + 29c x c 3 + 20c 2 c 3 + 39(f)c 1 c 2 c 2 3 + 9 2 {(? l c? z - s 2 ^) + 4> 2 (c 2 2 c 2 3 - sls 2 3 ) 
+9 2 4>(clcl - sjsl)c 2 c 3 + 94> 2 {c\cl - s 2 2 sl)cic 3 

Now assume that e := max(|#|, |0|) is close to zero. Then the conditional mutual 
information is, after tedious calculations, 

3 

16 

On the other hand, MixM p{x\, x 2 , x 3 \9, <f>) = 1 + 29cic 3 + 2(pc 2 c 3 has the conditional 
mutual information Ji2|3 = (3/4)# 2 2 + 0(e 5 ). The leading term is 4 times larger 
than that of SGM. □ 

We summarize the above examples in Table [H 

Example 7. We can construct more complicated densities by combining the pre- 
ceding ones. For example, let m = 3 and U — {(1, 2, 0), (0, 1, 1), (1, 1, 1)}. Let the 
corresponding parameter vector be 9 = (0.1,0.3,0.2). The vector 9 is feasible since 
([6]) is satisfied. The marginal and conditional 2-dimensional densities are illustrated 
in Figure [3j □ 



/i2| 3 = -9 2 J> 2 + 0(e 5 " 
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Table 1: Summary of the examples. For each example, the characteristics of SGM 
and MixM are compared. 



# 


Model name 


m 


Characteristic 


SGM 


MixM 


1 


1-dim. 


1 


(SGM=MixM) 






2 


independence 


2 


'is independent' 


TRUE 


FALSE 


3 


correlation 


2 


maximum correlation 


0.7558 


0.4928 


4 


heteroscedasticity 


2 


maximum f3\ 22 


0.5047 


0.4267 


5 


3-dim. interaction 


3 


maximum (5\ 23 


0.7743 


0.3459 


6 


conditional independence 


3 


leading coefficient of 7" 12 3 


3/16 


3/4 




(a) p(x 1 ,x 2 ) 



(b) p(x!,x 3 ) 



(c) p(x 2 ,x 3 ) 




x3 <3 



(d) p{x X) x 3 \x 2 = 3/4) (e) p(x u x 3 \x 2 = 1/4) 

Figure 3: The marginal and conditional densities forW = {(1, 2, 0), (0, 1, 1), (1, 1, 1)}. 
The figures (a), (b) and (c) are the marginal density p(xi,Xj) for each pair 
The figures (d) and (e) are the conditional density p(xi,x 3 \x 2 ) for specific values of 
x 2 . 
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3 Maximum likelihood estimation of SGM 



Let x(l), . . . , x(n) be independent samples drawn from the true density po(x) whose 
support is [0, l] m . From the definition of SGM, the maximum likelihood estimation 
of SGM is formulated as a convex optimization program: 

maximize log det I I + 9 u H u {x{t)) 

I + J20uH u (Oh0 V^G[0,l] m l 
ueu J 



subject to 9 G = < 9 G 



where we put H u (x) = D 2 (— n~ 2 YYp=i cos(nu p x p )) . Recall that D 2 is the Hessian 
operator and U is a finite subset of Z> . 

It is hard to write down explicitly. The difficulty follows from the statement 
"for any £ G [0, l] m " in the definition of 0. In general, for a set of feasible re- 
gions 0„ indexed by a , the region n a a is called a robust feasible region (see 
Ben-tal and Nemirovskil (11998I )). 

We consider two approaches to solve this problem. We will first give a sequence 
Q° M of regions converging to 0°, the interior of 0, as M — > oo. Hence the maximum 
likelihood estimator is calculated with arbitrary accuracy in principle. However, 
<d° M has about M m constraints on 9 and therefore it is usually expensive if m > 3. 
For the second approach, we give a proper subset of 0, which consists of only 
m constraints. As a by-product of the second approach, we obtain a lasso-type 
estimator because ht is compatible with /^-constraints. We call the maximizer of 
the log-likelihood over these constrained regions the constrained maximum likelihood 
estimator. The constrained maximum likelihood estimator is calculated via the 



determinant maximization algorithm (jVandenberghe et al.l (1l998l )). 

If m = 1, the feasible region is the set of Fourier coefficients of non-negative 

(120041 ) used Fejer's 



Fernandez-Duran 



functions. To deal with the feasible region, 
characterization: the Fourier series of any non-negative function is written as the 
square of a Fourier series. More specifically, for any r(x) = Y^Lo r « cos(tiux), its 
square r(x) 2 is of course non- negative and written by a Fourier series. The Fourier 
coefficients of r(x) 2 are written by quadratic polynomials of (r u )^ =0 . However, it is 
hard to use this representation for our problem because we assume 9 U = for u ^ U 
and this restriction is not affine in r u . 
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3.1 Inner approximation of feasible region 

Let 0° be the interior of 0. We give a sequence of tractable sets 0^ that converges 
to 0° from inside as M —>■ oo. We first remark the following lemma. 

Lemma 4. The set 0° is equal to {6 eR u \ D 2 ip(x\9) y Vx G [0, If 1 }. 

We prepare some notations for constructing Q° M . We consider the lattice points 
Lm, where L M = {-|, j } , ■ ■ ■ , |f }. Let \\u\loc = maxj \uj\ and U max = max ueU ||«|U- 
Define a linear operator K M on R u by {K M 9) U = 9J YYJLii 1 ~ U j/ M ) f o r ^K" 
Finally, we define 0^ for each M > U max + 1 by 

e° M = {6 e R u | D 2 i>(Z\K M e) y 0, V£ e } . 

Remark that 0^ is written in a finite number of constraints, in contrast to 0° and 
0. We have the following theorem. 

Theorem 2. For any M > £/ max + 1, we have Q° M C 0° and 

0° = limsup©^, 

where limsupj^..^ 0^ is defined by Dm'>i ^m>m' ©m- 

The constrained maximum likelihoo d estimator of 9 over Q° M is calculated via the 
determinant maximization algorithm (jVandenberghe et al.l (119981 )). Hence, in prin- 
ciple, we can calculate the maximum likelihood estimator with arbitrary accuracy. 
However, the region 0^ consists of \Lj^\ = (M + l) m constraints. This number is 
usually expensive if m > 3. In the following subsection, we give a proper subset of 
which consists of only m constraints. 

Example 8. Let m = 2 and U = {(1, 1), (2, 2)}. The approximated regions Q° M 
(M = 5, 10, 20, 40) are illustrated in Figure H] (a). For this case, we can give a precise 
expression of 0. The two eigenvalues of the Hessian matrix D 2 if)(x\6) are given by 

A± = 1 + 0(1,1) cos(7r(xi ± x 2 )) + 40(2,2) cos(27r(xi ± x 2 )). 

In the theory of time-series analysis, the function f(z) := 1 + cos(kz) of z is 

the spectral density of a MA(k) process with the autocorrelation coefficients (pj) 1 ^^. 
In particular, for MA(2), it is known that f(z ) is non- negative f or any z if and only 
if \pi \ + |/02 1 < 1 or p\ < 4p 2 (l — p-i) holds (see lBox and Jenkins! (119761 ) . Section 3.4). 
Therefore the feasible region for U — {(1, 1), (2, 2)} is given by 

10(1,1)1 + 140(2,2)1 < 1 Or (0(1,1)) 2 < 4(40(2,2))(1 - 40(2,2))- 

The region 0^ shown in Figure |4] (a) is close to this region. We also illustrate the 
approximated regions for another example U = {(1, 1), (3, 1)} in Figure 0] (b). □ 
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th_(1,1) 



thJ1,1) 



(a) U = {(1,1), (2, 2)}. 



(b)W = {(l,l),(3,l)}. 



Figure 4: The approximated region 0^ (solid line; M = 5, 10, 20, 40 from inner 
side) and the little parameter space 9 bt (dashed line) defined in Subsection 13.21 

We remark that the feasible region for MixM (Definition [2]) is approximated from 
the inner side by 



The proof is similar to that of Theorem [2] and omitted here. 

3.2 A conservative region and Lasso- type estimation 

We give a sufficient condition such that 9 G 0. Define a set ht by 



We call the little parameter space. It is an intersection of m constraints. In 
the following theorem, we show that the little parameter space ht is a subset of 
the feasible region 0. In other words, ht is more conservative than in the sense 
of robustness. We say that a subset V of U is linearly independent modulo 2 if a 
linear map £ : {0, 1} V h- > {0, l} m defined by £(e) = ^ ugV e u w (mod 2) has the kernel 
{0}. For each VcW, the set of vectors that have only V-components is denoted by 
K v = {9 g R u I 9 U = Vu £ V}. 

Theorem 3. For any U, bt C 0. Furthermore, if a subset V of U is linearly 
independent modulo 2, then we have bt n My = H My. In particular, if U itself 
is linearly independent modulo 2, then ht = 0. 



M 



{9 g R' 



p(t\K M 9) > 0, for f G LZ) 







.lit 
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By letting V be a one-element set {u}, we have the relation G flIR{ M } = 0nl{ u }. 
This shows that G ht contains at leat 2\U\ boundary points of G. The little parameter 
space for IA = {(1, 1), (2, 2)} and U = {(1, 1), (3, 1)} is indicated in Figure H] (a) and 
(b), respectively. 

The constrained maximum likelihood estimator of 9 over G ht is computed via the 
determinant maximization algorithm by introducing non-negative slack variables 9^ 
and 9~ such that 9 U = 9+ — 9~ and \9 U \ = 9+ + 9~ ■ T he estimato r is us ually sparse. 
This sparsity is closely related to the lasso estimator iTibshirani ( 1996 ) in that the 
regression method is executed with /^-constraints. Our little parameter space G ht 
is also represented by /^-constraints. Hence we call the constrained maximum like- 
lihood estimator of 9 over G ht the lasso-type estimator for SGM. Furthermore, we 
will use an indexed set Gji* with a tuning parameter r e [0, 1] by 



G 



lit 



9 e 



-5>>f>0 (Vj = l,...,m) ). 

ueu 

In particular, Qq 1 * = {0} and Q l f = Q ht . The tuning parameter r can be selected by 
cross validation. 

We remark that the feasible region for MixM (Definition [2]) has the following 
conservative region 



G 



lit 



9 e 



\u\ 



> 



Furthermore, if a subset V of U is linearly independent modulo 2, then we have 
= G fl Ry. The proof is similar to that of Theorem [3] and is omitted here. 



G n 



Recently, lass o-type estimators for graphical Gauss i an m odels are proposed by 



several authors: 



Yuan and Linl (120071 ) . Banerjee et all (I2008h and 



Friedmann et al 



( 120081 ) . On the other hand, a spa r se de nsity estimation (SPADES) for mixture 



models is considered in 



Bunea et al. 



(120071 ). Our MixM is considered as a version of 
SPADES although the estimation procedure is different. In Section HJ we compare 
SGM with MixM and the graphical Gaussian model by numerical examples. 



4 Numerical examples 

We give numerical examples on simulated and real datasets. We calculate the con- 
strained maximum likelihood estimator and study its predictive performance. We 
compare SGM with the graphical Gaussian model (with lasso) and MixM (Defini- 
tion Wj- 
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We describe some notations and assumptions. We use the following frequency set 
for SGM throughout this section: 



U = {ue Z> | IHloo < 2, < 3} , (9) 

where ||w||oo = maxj \v,j\ and Hull! = £\ \v,j\. The elements of U are given by 
(1,0,. ..,0), (2,0,. ..,0), (1,1,0,. ..,0), (2,1,0,. ..,0), (1,1,1,0,. ..,0) and their 
permutations of the components. The cardinality of U is m(m + l)(m + 5)/6. Let 
§m = 0m u )ueu an d ^r* = (^Tit)uew denote the constrained maximum likelihood 
estimators of 9 over the regions Q° M and G^, respectively (see Section [3] for the 
definition of Q° M and 9^). We call 9^ the lasso- type estimator of SGM. The same 
notations on the estimators are used also for MixM. 



The graphica 



Gau ssian lasso estimator C = G{r) of the concentration matrix 



(jYuan and Linl (120071 )) is formulated as follows 



min. {logdet(C) +tr(£C)} s.t. |Cy | < tJ^ 



>3h 

i<j i<j 



where S is the sample correlation and the tuning parameter r ranges over [0, 1]. If 
t = 1, the graphical Gaussian lasso estimator coincides with the maximum likelihood 
estimator (this is not the case for the lasso- type estimators of SGM and MixM). The 



partial correlation coefficient of X{ and Xj is estimated by pij = —Cij/ y CuCjj. 

For given raw data {D ti )i< t < n l <i< m , we preprocess it before estimation. For Gaus- 
sian models, we use the data D u scaled by the standard way: 



D ti = Dt ' n a \ D 4 = - f^Du, sdp.,) = 

sd{D.i) nj^ \ 



n 
t=i 



1 n 



For SGM and MixM, the data is further transformed into X ti = Q(D ti ), where $ is 
the standard normal cumulative distribution function, in order that Xu ranges over 
[0, 1]. By the transform <3>, the standard normal density as the null Gaussian model 
is transformed into the uniform density as the null SGM and the null MixM. 

We used th e package SDPT 3 for solving the determinant-maximization problem 
on MATLAB fiToh et all faxA . 



4.1 Simulation 

We first confirm that the maximum likelihood estimator is actually computed by 
the method described in Section [3j Consider Example [7| of Subsection 12.21 The true 
parameter is #(1,2,0) =0.1, #(0,1,1) = 0.3 and #(1,1,1) = 0.2 with the true frequency 
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set U Q = {(1, 2, 0), (0, 1, 1), (1, 1, 1)}. The frequency set 
written in a matrix form 



we use for estimation is 



1201201012010010 

U = (0011122000112001 

0000000111111222 



(10) 



The columns are arranged according to the lexicographic order. A result of estima- 
tion is given in Figure [5j The sample size is n = 100 and the number of experiments 
is 100. The samples were generated by the exact method of ISeil (120061 ). Both esti- 
mators actually distribute around the true parameter. 



t i 



J * 1 * 



liii 



(a) VJ^JIlu (M = 5). 



(b) (r=l). 



Figure 5: A simulation of estimation of SGM. The box-plot shows each component of 
the constrained maximum likelihood estimators (a) 9° M for M = 5 and (b) 8^ for r = 
1. The values are normalized by the square root \J J uu of the Fisher information. The 
horizontal axis denotes u EU arranged according to (ITU]) . The dashed line denotes 
the true parameter. The sample size is n = 100 and the number of experiments is 
100. 



We next compare SGM with MixM and Gaussian models. We consider a five- 
dimensional example. Let 4>(x\fi, £) denote the normal density with mean \i and 
covariance S. Let m = 5 and define the true density po{x) by 



Po(x) = (f)(x 1 \0,l)(f)(x2\x 1 ,l)(j)(x 3 \0,al(x2))(t){x4,x 5 \0,E4 5 (x 3 )), 



where 



<J \{ x 'i) = 1 + tanh(x2) and £45(0:3) 



1 tanh(x3) 
tanh(xs) 1 



By the definition, the set of variables (£1,22) has positive correlation, the vari- 
able £3 has heteroscedasticity against £2, and the set of variables (£3, £4, £5) has 
three-dimensional interaction. Remark that the density does not belong to SGM. 
A numerical result is shown in Table [2j The sample size is n = 40 and the number 
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of experiments is 200. All of the three models detected the correlation of the pair 
(xi,X2)- However, only SGM effectively detected the heteroscedasticity of (x2,xs) 
and the three-dimensional interaction (x$, 24,0:5). The estimator of MixM was too 
sparse, and did not effectively detect them. 

For the same true density, we also computed the predictive performance of the esti- 
mators of SGM, MixM and Gaussian. We use the expected predictive log-likelihood 
as the index of the predictive performance. The arbitrary constant of the log- 
likelihood is determined in such a way that the log-likelihood of the null model is 
zero. The sample size is n = 40 for observation and 10 for prediction. The number 
of experiments is 200. The maximum mean predictive log-likelihood of SGM is esti- 
mated as 3.37(±0.33) at r = 1.0, where the confidence interval is based on the 95% 
interval with the normal approximation. For MixM and Gaussian, the maximum 
value is estimated as 1.99(±0.15) at r = 1.0 and 2.72(±0.26) at r = 0.32, respec- 
tively. Hence SGM has better predictive performance than MixM and Gaussian. 



4.2 Real dataset 



We consider th e digoxin clearance data reported in lHalkin et al.l (119751 ) (see also 
Edwardsl (120001 )). The data consists of creatinine clearance (xi), digoxin clearance 
(x%) and urine flow (x 3 ) of 35 patients. In Table [3j we compare the lasso-type 
estimators of SGM, MixM and the Gaussian model. The result shows that for 
the data our SGM gives slightl y better predict ive performance than MixM and the 
Gaussian models. As stated in 



Ed wards N2OO0L partial correlation of (xi,x 3 ) is not 
significant. However, our model suggests a heteroscedastic effect of x\ (creatinine 
clearance) against x 3 (urine flow). 



5 Discussion 



We defined SGM as a set of the potential functions ip and studied its feasible region 
to calculate the constrained maximum likelihood estimator. SGM was applied to 
both simulated and real dataset. We discuss remaining mathematical and practical 
problems. 

We used the finite Fourier expansion to define the potential function ip as Eq. (j2j) . 
It is sometimes hard to describe local behavior of the density function if we use this 
expansion. For such purposes, we can use wavelets instead of the cosine functions 
as long as the resultant potential function satisfies the Neumann condition (Tj0). For 
example, assume that we want to describe tail behavior of two-dimensional data 
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Table 2: Mean value of the lasso-type estimators for the five-dimensional data. 
The tuning parameter r is set to 1. The sample size is n = 40 and the number of 
experiments is 200. The confidence interval is based on the 95% interval with the 
normal approximation. For SGM and MixM, only top ten values of \/~J uu 9^ t u are 
shown. For the Gaussian model, u is the indicator vector of a pair (i, j). 



SGM 


MixM 


Gaussian 


u 




u 




u 




(1,1,0,0,0) 


0.510 (±0.013) 


(1,1,0,0,0) 


0.123 (±0.006) 


(1,1,0,0,0) 


0.706 (±0.011) 


(0,0,1,1,1) 


-0.297 (±0.017) 


(0,1,2,0,0) 


-0.031 (±0.005) 


(1,0,0,0,1) 


-0.023 (±0.023) 


(0,1,2,0,0) 


-0.232 (±0.015) 


(0,0,1,1,1) 


-0.007 (±0.003) 


(0,1,1,0,0) 


0.014 (±0.023) 


(0,0,2,0,0) 


-0.106 (±0.014) 


(0,0,2,0,0) 


-0.006 (±0.002) 


(1,0,0,1,0) 


-0.010 (±0.022) 


(2,0,0,0,0) 


-0.095 (±0.011) 


(0,2,0,0,0) 


-0.002 (±0.001) 


(0,1,0,0,1) 


0.008 (±0.024) 


(0,2,0,0,0) 


-0.084 (±0.010) 


(1,0,2,0,0) 


-0.002 (±0.001) 


(0,0,0,1,1) 


-0.007 (±0.028) 


(0,0,0,0,2) 


-0.043 (±0.013) 


(2,0,0,0,0) 


-0.001 (±0.001) 


(0,1,0,1,0) 


0.007 (±0.024) 


(0,0,0,2,0) 


-0.043 (±0.010) 


(0,2,0,1,0) 


-0.000 (±0.001) 


(0,0,1,1,0) 


-0.006 (±0.023) 


(1,0,2,0,0) 


-0.036 (±0.009) 


(0,0, 1,0,2) 


-0.000 (±0.001) 


(1,0,1,0,0) 


-0.004 (±0.021) 


(0,0,0,2,1) 


-0.015 (±0.015) 


(0,0,0,0,2) 


-0.000 (±0.001) 


(0,0,1,0,1) 


0.004 (±0.023) 



Table 3: A result for the digoxin data. The lasso-type estimators of SGM, 
MixM and the graphical Gaussian model are shown. Only non-zero values are 
displayed. For the Gaussian model, the estimated partial correlation of the pairs 
{1, 2}, {1, 3}, {2, 3} is displayed on the row u = (1, 1, 0), (1, 0, 1), (0, 1, 1), respec- 
tively. The cross-validated predictive log-likelihood (referred to as CV prediction) 
is put on the bottom. For each model, the asterisk V indicates the optimal tuning 
parameter selected by CV prediction. 





SGM 




MixM 




Gaussian 




r = 0.5 r 


= 1.0* 


r = 0.5 T-- 


= 1.0* 


r = 0.25* t = 1.0 


(1,1,0) 


0.351 


0.558 


0.177 


0.354 


0.480 0.758 


(0,1,1) 


0.149 


0.301 






0.217 0.485 


(2,0,1) 




-0.166 








(1,0,1) 


0.149 


0.148 






-0.191 


u (0,0,2) 


-0.070 


-0.147 








(0,2,0) 




-0.088 








(1,0,2) 




0.072 








(0,0,1) 


0.073 


0.050 








(0,1,2) 




-0.039 








CV prediction 


11.19 


14.54 


6.95 


12.26 


14.49 -0.92 
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around x = (1,1). Then we can use a function 

i>(x\9, a) = (xl + x\)/2 + 7r~ 2 6>(2 + cos(vrxi) + cos(vrx 2 )) a , 

where a > 1/2. A typical shape of the density function p(x\8, a) = det (D 2 ip(x\9, a)) 
is given in Figure El One can confirm that the gradient map Dip is continuous on 
[0, l] 2 and satisfies the Neumann condition (j3J). A sufficient condition for convexity 
of if) is < 9 < 2 1 ~ 2a /a. If a < 1, then the tail behavior of p(x\9, a) is 

p(x\9,a) ~ 9 2 a 2 (2a-l)(^-{(l-x 1 ) 2 + (l-x 2 ) 2 } 

as (xi,x 2 ) — > (1,1). The proofs of these facts are omitted. Although estimation of 9 
is described by the determinant maximization, that of a is not. Further investigation 
is needed. 




Figure 6: The density function p(x\9, a) for a = 0.75 and 9 = 2 1 2a /a. 

If any covariates are available together with given data, we can include the co- 
variates in the parameter 9 of SGM. However, since the parameter space of SGM 
is not the whole Euclidean space, its use is restricted. 

The author recently proved an inequality on Efron's statistical curvature, in that 
the curvature of SGM at the origin 9 = is always smaller than that of MixM ([5]). 
This fact is not so practical but it supports SGM. Since the statement and the proof 
of this inequality are rather complicated, we will present them in a forthcoming 
paper. 

We constructed a lasso-type estimator on SGM as a byproduct of the conservative 
feasible region in Section [3l Performance of the estimator is numerically studied in 
Section HI For the existing lasso estimators, some asymptotic results are known when 
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the sample size n and/or the numb e r m of variates increase (IKnight and Fu 



Bunea et al. 



2000), 



(120071 ) . iBaneriee et al 



Meinshausen and Biihlmannl (120061 ). lYuan and Lin! (120071 ). 

( 20081 )). We think it is important to compare our SGM with the Gaussian, mixture 
and exponential models on the asymptotic argument. 



A Proofs 

A.l Proof of Lemma [L] 

Let ip have the form (j2J) and choose any 9 such that D 2 ip(x\6) y for every x G 
[0, l] m . We prove that the gradient map Dip(-\8) is a bijection on [0, l] m . If 9 = 0, 
then the bijectivity of Dip(x\6) = x is clear. Therefore we assume 6^0. We can 
extend the domain of ip(-\0) from [0, l] m to whole W 71 by using Eq. (jSJ), and denote 
the extended function by ip(x) = ip(x\9) for x G IR m . Since ip(x) is a periodic and 
even function along each axis, the convexity condition D 2 ip y holds over x G M. m . 
We will prove that (i) Dip is a bijection on W 71 and (ii) Dip is a bijection on each 



hyperplane {x 



b}, where j G {!,..., m} and b G {0, 1}. We first show that the 



bijectivity on [0, l] m follows from the conditions (i) and (ii). Indeed, if (i) and (ii) are 
fulfilled, then for each j G {1, . . . , m} the sandwiched region {x G M m | < Xj < 1} 
between two hyperplanes is mapped onto itself because Dip is continuous. Therefore 
[0, l] m is injectively mapped onto itself. To prove (i), it is sufficient to show that ip is 
strictl y convex and co- fin ite: lim^oo ^(^Vll^ll = whenever (see Theorem 

26.6 of lRockafellerl fjl97oh ). We define a function / (z) of z Glby/(z) =ip{x + ze), 
where Xq G M m and e G IR m \ {0} are arbitrary. Then f"(z) > for any z since 
D 2 ip(x) y for any x G M m . However, since f"{z) is a non-constant analyitc 
function (recall that 9^0), f"(z) must be positive except for a finite number of 
z for each bounded interval. Hence /, and therefore ip, is strictly convex. The 
co-finiteness of ip is immediate because ip is sum of x T x/2 and a bounded function. 
Hence (i) was proved. Next we prove the condition (ii). We consider the hyperplane 
{x | x m = b}, where b G {0, 1}, without loss of generality. Denote the restriction of 
ip to {x | x m = b} by ipm-i- Then ip m -\ has the following expression 



~2 + 2 



1 m—l 



«„(-!) 



m—l 

n 



COS{TTUjXj / 



i=l 



This function is the same form as Eq. §2§ with the dimension m — l. The convexity 
condition (d 2 ip m _i/dxidxj) y is also satisfied because ip m -i is a restriction of ip. 
Thus (ii) is proved in the same manner as the proof of (i). 
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A. 2 Proof of Lemma [2] 

A statistical model is a mixture model if and only if all the second derivatives of the 
density function with respect to the parameter vanish. Hence we calculate the second 
derivative of the density function of SGM. Put Zj := {u G Z> | Uj = Vj ^ i}. If 
U C Zj for some i, then it is easy to confirm that SGM becomes a mixture model 

p(x\9) = 1 + 9 u Uj cos(irUiXi). 

Hence we assume that U <£_ Zj for any i. Then there exist u,v G W (the case u = v 
is available) such that |cr(w) U cr(f)| > 2, where a(u) = {j \ Uj > 0}. Putting 
A u = {D 2 tfj(x\9)y 1 {d/d9 u (D 2 ilj(x\9))} we have 

d 2 p(x\6) 

— — — — — = tr A u tr A„ — tr[y4 u A„J. 
o9 u 89 v 

Since AJ e=0j:r=0 = diag(u?, . . . , w^), we have 
<9 2 p(a;|#) 



I l|2|| ||2 

\u\\ \\v\\ 



5> 2 ^ = EE u H 2 > °> 

e=o,x=o j i 



where the last inequality follows from \a(u) Ua(v)\ > 2. Thus SGM is not a mixture 
model as long as U <f_ Zj for any z. 

A. 3 Proof of Lemma(3] 

The score function of SGM at 9 = is directly calculated as 

d 



09 



\ogp(x\9) 



\u 



e=o j=1 



f JJcOs(7TM j X j ). 



The score function of MixM is also easily proved to be L u . Then the Fisher infor- 
mation matrix of both the models is 



p(x\0)L u L v dx = ||u|| 2 ||t> || 2 TT / cos(iriijXj) cos(iTVjXj)dxj. 

j=1 Jo 

Here the integral is calculated by the following formula 

j [ 1 if Uj = Vj = 0, 

cos(7TUjXj) cos(iiVjXj)dxj = < 1/2 if Uj = Vj > 0, 



if Uj 7^ v 



21 



A. 4 Proof of Equations (ED and (|8 



We first prove Eq. ([7]). Let m = 1 and U = {u}. We only consider the case u — 1. 
The other cases are similarly proved. Put # = 9\. Since ^(xil^) = 1 + 6'cos(7rxi), 
we have 

J 1 + 6>cos(7rxi) 
By putting z = exp(i7ru£i), we obtain 

T (9) —I (^ + ^ 1 )V4 dz j_r (z 2 + l) 2 

uu{ ' " 2mJ lzl=1 l + e(z + z^)/2z " Amf lzl=1 z 2 (9z 2 + 2z + 9) *' 

The poles of the integrand inside the unit circle are and z + , where z± := (-1 ± 
\J\ — 6 2 )/6. By the residue theorem, we obtain 



2\e 2 j 2z 2 + e(z+-z„) 9 2 VT^¥ ' 

This proves Eq. (J7j). 

We next prove Eq. ([8]). Put u = (1, 1) and 9 = 9 U . We use the following identity 

(x\9) det ( 1 + ^ cos ( a;i ) cos ( x 2) — 0sin(a;i) sin(x 2 ) 
I — 9 sin(xi) sin^) 1 + 9 cos(a;i) cos(x2) 
= (1 + 6'cos(7r(a;i - x 2 )))(l + 9cos(n(x 1 + x 2 ))). 

The Fisher information is 

J (9) = [ ( cos2 ( n ( x ^- x ^) + cos 2 {7r{x 1 + x 2 )) \ 
J[o,i] 2 \1 + ^cos(7r(a;i - x 2 )) I + 9 cos(n(x 1 + x 2 )) J 

cos 2 (it (xi - x 2 )) cos 2 (7r(xi + x 2 )) . 

+ ^ 1—, rr dxidx 2 



4 7[_i i i]2 \1 + 9 cos(7r(xi — x 2 )) 1 + 9 cos(7r(xi + x 2 )) 
cos 2 (tt|/i) cos 2 (vn/ 2 , > , , 



4 7[-i i i]2 \ 1 + 9 cos(-Kyi) 1 + # cos(ny 2 ] 

where the last equality follows from the transformation yi = x\—x 2 and 2/2 = x\ + x 2 , 
and from the periodicity of the integrand. Then ([8]) is proved in the same manner 
as the proof of ([7]). 

A. 5 Proof of Lemma [4] 

We use the following elementary lemma. Put S = {A >z | tr A — 1}. Note that S 
is compact. 

Lemma 5. Let X be a real symmetric matrix. Then the minimum eigenvalue of X 
is given by min^ e5 ti(AX). 
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Proof. Let X = ^e(i)e(z) T be the spectral decomposition of X, where £1 < • • • < 
£ m and e{i) T e{i) = 1. For any A G S, 

tr(AX) = J2tMi) T Mi)) > £i$>(j) T Ae(j)) = 6- 

The equality is attained at A = e(l)e(l) T . □ 
Let H u (x) = D 2 (— n~ 2 YljLi cos(iTUjXj)). Then 

D 2 i>(x\9) = I + J2 e uH u (x). 

The minimum eigenvalue /Wi(#) of D 2 ip(x\9) minimized over x G [0, l] m is 
Pmin(0) = 1+ min y^9 u tr(AH u (x)). 

u<=U 

Recall that the parameter space 6 is expressed as = {9 G M u \ p m \ n {9) > 0}. We 
prove that the interior of 9 is 0° = {9 G R u \ p m m{9) > 0}. Put 

a = max max max\tT(AH u (x))\ < oo. 

u&A xG[0,l] m Aes 

We first prove that if p m i n (9) > 0, then 9 G 0°. Indeed, if rj G M w is sufficiently 
small, then 

Pmm(9 + 1l) > p min (9) - PL 22 M > 0- 

We next prove that if p m i n (^) = 0, then 6> G \ 0°. Since p m i n (9) = 0, there exist 
some A G S and some x G [0, l] m such that tr(A-D 2, ?/>(a;|#)) = 0. For such an x, 
there exists some v G U. such that v tr(Aff„(a;)) < 0. Define a vector 77 G M w by 
r^u = 6 I „1{ U=1 ,}. Then, for any e > 0, we have 

Pmin(9 + er]) < tr{AD 2 iP(x\9 + er))) = e9 v tr{AH v (x)) < 0. 

This implies that 9 is a boundary point of 0. Hence Lemma H] was proved. 

A. 6 Proof of Theorem [2] 

We first recall some notations. We use [to] = {1, . . . , m} and L M = {-j^, -h-,--- , ff }• 
The supremum norm of s G Z m is defined by ||s||oo := max,,- \sj\. Recall that 
Umax = max ue ^ ||m||oo- We denote U = U max for simplicity. Recall that Km is a 
linear map on R u defined by K M 9 = {9 u j njLi(l - Uj/M)) ueU . 
Define a set K^Q by 

A7/0° := {K-/9 I 9 G 0°} = {0 I D 2 ^(x|ir M #) ^ Vx G [0, l] m }. 

Then we have K^/Q C Q° M by the definition of Q° M . Hence, the theorem follows 
from the following two claims. 
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(i) lim supine = 0°. 

(ii) S° M C 0° for any M. 

We first prove (i). Put S = {A >z | trA = 1} and f(x\6,A) = tT[AD 2 tfj(x\ 
By Lemma|5] and compactness of [0, l] m x S, a vector 9 belongs to 0° if and only if 

min f(x\9, A) > 0. 
x€[o,i]"Mes 

Now it is sufficient to prove that, for any 9 G M u , f(x\Ku9, A) converges to f{x\9, A) 
uniformly in x G [0, l] m and A G S. Let H u (x) := D 2 (—ir~ 2 YYjLi cos(jrUjXj)). Then 
we have f(x\9, A) = 1 + J2 u eu 6uto[AH u (x)] and therefore 

\f(x\K M 9,A)-f(x\9,A)\ < J2\{( K M0)u-O u }ti[AH u (x)}\. (12) 

Since the function tx[AH u {x)] of (x,A) G [0, l] m x S is bounded and since (K M 9) U 
converges to 9 U for each u G W as M — > oo, the right hand side of f[T2"j) converges to 
uniformly in x and A. 

Next we prove (ii). Let Rm = { — ^w^i ■ ■ ■ j ^w~i if}- We extend the domain of 
i/j from [0, l] m to M m as done in the proof of Lemma [U and denote it again by ip. 
If 9 £ Q° M , then D 2 iP(£\Km9) is positive definite for any £ G i?^} because ^>(x|0) is 
an even function with respect to each coordinate Xj. Then it is sufficient to prove 
that D 2 ip(x\9) for any x is written as a convex combination of {D 2 ip( K ^\KM9)}^£R m . 
Define a Fejer-type kernel Qm by 

n M 1 y^ 1 y^ 1 w(q-6) 2 1 / sin(7rM2:/2) \ 2 

(°cM\ z ) ■ 2M2 LL e 2M 2 V sin(^/2) J ' 

a=0 t=o \ v / y / 

Then the following lemma holds. 
Lemma 6. For any M > U + 1, we have 

m 

DV(^i^) = ^^(£1^)11^(^-0). 

feiis j=i 

The right hand side is a convex combination of {D 2 iP(^\Km9)}^r^. 
Proof. For each j G {1, . . . , m}, define an operator Kmj on M w by 

Q 

{Km " 9)u = 1-uj/M- 
Then we have Km = YYjLi Kmj from the definition. It is sufficient to show that 

D 2 i){x\9) = /^'••(Cr-'- , K Mtj 9)Q M {x j -^), (13) 
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where x\j = (xi, . . . , Xj-i, Xj+x, ■ ■ ■ , x m ). In fact, if f|T3|) is proved, then 
D 2 i,{x\6) = D 2 ^ 1 ,x 2 ,...,x m \K M)1 9)Q M (x 1 -^) 

2 

= Yl Yl d2 ^&...,xm\Km,iK m , 2 9)\[Qm{x -Q 

m 

= Y D 2 ^\K M 9) HQm(xj - Q. 

We prove (fT3l) for j = 1 without loss of generality. We first describe D 2 ifj(x\6) in 
terms of {e 17TsTx } s< zz m - For each s G Z m , we define a m x m matrix 

/ if s = 0, 

^2H <J ( U )I SS T if = u . for all j G [m] for some u EU, 
otherwise. 

Recall that er(tt) = {j G [m] | Uj > 0}. Then, by applying the Euler's formula 
cos(7TUjXj) = (e mu i x o — Q- mu o x i)/2 to Eq. (j2J), we can show that 



D 2 ip(x\9) = F ^ 



Recall that U = max ueW IM|oo- The right hand side of (JT3J) with j — 1 is 



SiS-Rm V |a | TO <l/ 1 17 / \ a=0 b=0 



M-l Af-1 



_p e i7r((o-6)^i+s^ 1 a:\i) ^ 



= V V V — — — V e ^ (si ~ a+b)51 

M-lsd 2M ^ 

a=0 fe=0 Hslloo^fy 1 1 6Gi?M 

M~l p e ins T x 

= YY Y M —\ s \ l ^= a - b mod 2M >- 

a=0 6=0 ||s||oo<(7 1 

For any Si with < U < M, the cardinality of the set 

{(a, b) G {0, . . . , M - l} 2 | si = a - 6} 
is M — |si|. Hence we have 

^ D 2 ^ h x v \K M ,ie)QM(xi-ti) = Y F ° elnsTx = D 2 ^)- 

£\&Rm \\s\\oo<U 
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Therefore ffl3l) was proved. 

Now we prove that {Ilj=i Qm( x 3 ~ Cj)}^eR^ becomes a probability vector. In fact, 
non- negativity follows from the definition of Q m and the total mass is 1 because 

M-lM-l 1 M-1M-1 

£ ««<*.-&) = iEES e , " < °- i,(I '- s ' ) = sEE'm, = i. 

Therefore the lemma and Theorem [2] are proved. □ 
A. 7 Proof of Theorem [3] 

Let G Ut . We show that D 2 ip(x\9) h for all x G [0, l] m . By Euler's formula, we 
obtain 

m 

JTcos^MjXj) = 2~ m cos(7ia T d(u)x), 

3=1 ae{— l,l} m 

where d(u) is the m x m diagonal matrix with the diagonal vector u. Note that 
2 ~ m Eae{-i,i}-» aaT = L Then 

Q 

D 2 i/j(x\9) = I + ^ cos(7ra T d(M):r)d(M)a:a: T <i(u) 



uGW aG{-l,l} m 



lu? 



h o. 

This implies that 9 G 0. 

Next we assume that V C W is linearly independent modulo 2. Since bt C 0, 
it is sufficient to prove that n R v C Ut nl v . Let 9 G n R v . We evaluate 
D 2 i)(x\9) at lattice points f G {0, l} m . For any f G {0, l} m and any v G Z m , we 
have 

/in \ 

= (-ir ,e d(«) a . 

Since V is linearly independent modulo 2, we can choose £ G {0, l} m such that 
t> T £ = !{0„>o} (mod 2) for all t> G V. Then 



j'=i 



cos^vrWjX^ 



o =< dvw)!^ = i + ^^(-ir T ^) 2 = i-X>k 

This means G Ut fl My 
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